Bienvenido a la primera práctica de Jupyter Notebook

Esta primera práctica se realizará sobre los datos obtenidos de la página http://insideairbnb.com/get-the-data.html

Para cargar librerías

Antes de realizar un análisis, se debe verificar si se tienen las librerías instaladas, si no se tiene alguna librerìa, se debe proceder a instalarlar.

In [196]:
import os, sys
import numpy as np
import pandas as pd
import pandas_profiling   ####Verificar que esté instalada. Si no está instalada, se puede utilziar el comando conda install -c conda-forge pandas-profiling --y

Configuración del directorio

In [197]:
#%cd C:\Users\USUARIO\Documents\Monitoria_cda_2019_2\Clases_Python\Clase1_Python_ams    

Lectura de los datos

In [198]:
#data= pd.read_csv('listings.csv', sep=',',  error_bad_lines=False)

data=pd.read_csv('listings.csv')
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3145: DtypeWarning: Columns (94) have mixed types.Specify dtype option on import or set low_memory=False.
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
In [199]:
##Para seleccionar una primera vista de la tabla
data.head()
Out[199]:
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview ... instant_bookable is_business_travel_ready cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 2818 https://www.airbnb.com/rooms/2818 20190708161114 2019-07-09 Quiet Garden View Room & Super Fast WiFi Quiet Garden View Room & Super Fast WiFi I'm renting a bedroom (room overlooking the ga... Quiet Garden View Room & Super Fast WiFi I'm r... none Indische Buurt ("Indies Neighborhood") is a ne... ... t f strict_14_with_grace_period f f 1 0 1 0 2.09
1 20168 https://www.airbnb.com/rooms/20168 20190708161114 2019-07-09 Studio with private bathroom in the centre 1 Cozy studio on your own private floor, 100% in... For those who like all facets of city life. In... Cozy studio on your own private floor, 100% in... none Located just in between famous central canals.... ... f f strict_14_with_grace_period f f 2 0 2 0 2.45
2 25428 https://www.airbnb.com/rooms/25428 20190708161114 2019-07-09 Lovely apt in City Centre (w.lift) near Jordaan NaN This nicely furnished, newly renovated apt is... This nicely furnished, newly renovated apt is... none NaN ... f f strict_14_with_grace_period f f 2 2 0 0 0.17
3 27886 https://www.airbnb.com/rooms/27886 20190708161114 2019-07-09 Romantic, stylish B&B houseboat in canal district Stylish and romantic houseboat on fantastic hi... For a romantic couple: A beautifully restored ... Stylish and romantic houseboat on fantastic hi... none Central, quiet, safe, clean and beautiful. ... t f strict_14_with_grace_period f f 1 0 1 0 2.14
4 28871 https://www.airbnb.com/rooms/28871 20190708161114 2019-07-09 Comfortable double room NaN In a monumental house right in the center of A... In a monumental house right in the center of A... none NaN ... f f moderate f f 3 0 3 0 2.56

5 rows × 106 columns

Como se puede observar en la vista anterior, la primera columna no tiene nombre, esto es por que el data frame de pandas asigna un índice, el cual nos puede ayudar a realizar algunas consultas, por ejemplo, suponga que se va seleccionar la primera fila, es decir la fila asociada al índice 0. para tal fin, se utiliza df.loc[a:b, "column name"], donde a y b corresponden al rango del índice de las filas y column name el nombre de la variable a seleccionar, si se desea seleccionar todas las variables, se agrega el comando :

In [200]:
data.loc[0:0, : ]
Out[200]:
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview ... instant_bookable is_business_travel_ready cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 2818 https://www.airbnb.com/rooms/2818 20190708161114 2019-07-09 Quiet Garden View Room & Super Fast WiFi Quiet Garden View Room & Super Fast WiFi I'm renting a bedroom (room overlooking the ga... Quiet Garden View Room & Super Fast WiFi I'm r... none Indische Buurt ("Indies Neighborhood") is a ne... ... t f strict_14_with_grace_period f f 1 0 1 0 2.09

1 rows × 106 columns

Ahora, ¿qué pasaría si no consideraramos 0:0, y por el contrario, tomamos sólo 0.

In [201]:
data.loc[0, : ]
Out[201]:
id                                                                                  2818
listing_url                                            https://www.airbnb.com/rooms/2818
scrape_id                                                                 20190708161114
last_scraped                                                                  2019-07-09
name                                            Quiet Garden View Room & Super Fast WiFi
                                                                  ...                   
calculated_host_listings_count                                                         1
calculated_host_listings_count_entire_homes                                            0
calculated_host_listings_count_private_rooms                                           1
calculated_host_listings_count_shared_rooms                                            0
reviews_per_month                                                                   2.09
Name: 0, Length: 106, dtype: object

Suponga ahora, que se va seleccionar las filas asociadas a las filas 0,1 y 2.

In [202]:
data.loc[0:2, : ]
Out[202]:
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview ... instant_bookable is_business_travel_ready cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 2818 https://www.airbnb.com/rooms/2818 20190708161114 2019-07-09 Quiet Garden View Room & Super Fast WiFi Quiet Garden View Room & Super Fast WiFi I'm renting a bedroom (room overlooking the ga... Quiet Garden View Room & Super Fast WiFi I'm r... none Indische Buurt ("Indies Neighborhood") is a ne... ... t f strict_14_with_grace_period f f 1 0 1 0 2.09
1 20168 https://www.airbnb.com/rooms/20168 20190708161114 2019-07-09 Studio with private bathroom in the centre 1 Cozy studio on your own private floor, 100% in... For those who like all facets of city life. In... Cozy studio on your own private floor, 100% in... none Located just in between famous central canals.... ... f f strict_14_with_grace_period f f 2 0 2 0 2.45
2 25428 https://www.airbnb.com/rooms/25428 20190708161114 2019-07-09 Lovely apt in City Centre (w.lift) near Jordaan NaN This nicely furnished, newly renovated apt is... This nicely furnished, newly renovated apt is... none NaN ... f f strict_14_with_grace_period f f 2 2 0 0 0.17

3 rows × 106 columns

Ahora, se va a seleccionar únicamente la variable "name" de las tres primeras filas, es decir, de los índices 0, 1 y 2.

In [203]:
data.loc[0:2, 'name']
Out[203]:
0           Quiet Garden View Room & Super Fast WiFi
1       Studio with private bathroom in the centre 1
2    Lovely apt in City Centre (w.lift) near Jordaan
Name: name, dtype: object

A continuación, se va seleccionar las variables "name" y "summary" y las tres primeras filas, es decir, los índices 0, 1 y 2.

In [204]:
data.loc[0:2, ['name', 'summary']]
Out[204]:
name summary
0 Quiet Garden View Room & Super Fast WiFi Quiet Garden View Room & Super Fast WiFi
1 Studio with private bathroom in the centre 1 Cozy studio on your own private floor, 100% in...
2 Lovely apt in City Centre (w.lift) near Jordaan NaN

Finalmente, se va seleccionar únicamente la variable "name", y se considerarán todas las filas.

In [205]:
data.loc[:, 'name']
Out[205]:
0                 Quiet Garden View Room & Super Fast WiFi
1             Studio with private bathroom in the centre 1
2          Lovely apt in City Centre (w.lift) near Jordaan
3        Romantic, stylish B&B houseboat in canal district
4                                  Comfortable double room
                               ...                        
20332    Apartment Amsterdam next to city center (1 per...
20333             Penthouse in Watergraafsmeer (Amsterdam)
20334    Cozy room in Amsterdam with nice (Hidden by Ai...
20335          City apartment near centre, water and beach
20336                    Room in lovely central  apartment
Name: name, Length: 20337, dtype: object

Estudio sobre el conjunto de variables ¿cuál es el tipo de variables? ¿realmente la herramienta lee correctamente los datos?

A continuación, se realiza un primer análisis sobre el tipo de variables, para tal fin, se utiliza la función data.dtypes

In [206]:
variables = data.dtypes

Hay algunas variables para las cuales toca tener mucho cuidado, dado que el tipo no es corresto.

In [207]:
variables[0:50]
Out[207]:
id                                int64
listing_url                      object
scrape_id                         int64
last_scraped                     object
name                             object
summary                          object
space                            object
description                      object
experiences_offered              object
neighborhood_overview            object
notes                            object
transit                          object
access                           object
interaction                      object
house_rules                      object
thumbnail_url                   float64
medium_url                      float64
picture_url                      object
xl_picture_url                  float64
host_id                           int64
host_url                         object
host_name                        object
host_since                       object
host_location                    object
host_about                       object
host_response_time               object
host_response_rate               object
host_acceptance_rate            float64
host_is_superhost                object
host_thumbnail_url               object
host_picture_url                 object
host_neighbourhood               object
host_listings_count             float64
host_total_listings_count       float64
host_verifications               object
host_has_profile_pic             object
host_identity_verified           object
street                           object
neighbourhood                    object
neighbourhood_cleansed           object
neighbourhood_group_cleansed    float64
city                             object
state                            object
zipcode                          object
market                           object
smart_location                   object
country_code                     object
country                          object
latitude                        float64
longitude                       float64
dtype: object
In [208]:
variables[51:100]
Out[208]:
property_type                     object
room_type                         object
accommodates                       int64
bathrooms                        float64
bedrooms                         float64
beds                             float64
bed_type                          object
amenities                         object
square_feet                      float64
price                             object
weekly_price                      object
monthly_price                     object
security_deposit                  object
cleaning_fee                      object
guests_included                    int64
extra_people                      object
minimum_nights                     int64
maximum_nights                     int64
minimum_minimum_nights             int64
maximum_minimum_nights             int64
minimum_maximum_nights             int64
maximum_maximum_nights             int64
minimum_nights_avg_ntm           float64
maximum_nights_avg_ntm           float64
calendar_updated                  object
has_availability                  object
availability_30                    int64
availability_60                    int64
availability_90                    int64
availability_365                   int64
calendar_last_scraped             object
number_of_reviews                  int64
number_of_reviews_ltm              int64
first_review                      object
last_review                       object
review_scores_rating             float64
review_scores_accuracy           float64
review_scores_cleanliness        float64
review_scores_checkin            float64
review_scores_communication      float64
review_scores_location           float64
review_scores_value              float64
requires_license                  object
license                           object
jurisdiction_names                object
instant_bookable                  object
is_business_travel_ready          object
cancellation_policy               object
require_guest_profile_picture     object
dtype: object

A continuación, se va seleccionar un conjunto de variables que van a ser de interés para un análisis preliminar.

In [209]:
data_analisis=data.loc[:, ['id','room_type', 'bathrooms', 'bedrooms', 'bed_type', 'price', 'weekly_price',
       'monthly_price', 'security_deposit', 'city', 'state', 'country','first_review','last_review', 'review_scores_rating', 'number_of_reviews','has_availability', 'availability_30', 'availability_60',
       'availability_90', 'availability_365']]
In [210]:
data_analisis.head()
Out[210]:
id room_type bathrooms bedrooms bed_type price weekly_price monthly_price security_deposit city ... country first_review last_review review_scores_rating number_of_reviews has_availability availability_30 availability_60 availability_90 availability_365
0 2818 Private room 1.5 1.0 Real Bed $59.00 NaN $1,500.00 $200.00 Amsterdam ... Netherlands 2009-03-30 2019-06-28 98.0 262 t 5 21 35 107
1 20168 Private room 1.0 1.0 Real Bed $80.00 NaN NaN NaN Amsterdam ... Netherlands 2010-03-02 2019-07-08 88.0 279 t 0 11 32 140
2 25428 Entire home/apt 1.0 1.0 Real Bed $125.00 $650.00 $2,000.00 $300.00 Amsterdam ... Netherlands 2018-01-21 2019-05-11 100.0 3 t 0 4 6 106
3 27886 Private room 1.0 1.0 Real Bed $150.00 $810.00 $2,500.00 $0.00 Amsterdam ... Netherlands 2012-01-09 2019-07-01 99.0 195 t 0 6 14 74
4 28871 Private room NaN 1.0 Real Bed $75.00 $499.00 $1,956.00 NaN Amsterdam ... Netherlands 2010-08-22 2019-07-02 97.0 277 t 5 7 10 138

5 rows × 21 columns

In [211]:
data_analisis.dtypes
Out[211]:
id                        int64
room_type                object
bathrooms               float64
bedrooms                float64
bed_type                 object
price                    object
weekly_price             object
monthly_price            object
security_deposit         object
city                     object
state                    object
country                  object
first_review             object
last_review              object
review_scores_rating    float64
number_of_reviews         int64
has_availability         object
availability_30           int64
availability_60           int64
availability_90           int64
availability_365          int64
dtype: object

Se va corregir los signos pesos, para tal fin, se utilizará un conjunto de funciones

In [212]:
def eliminar_pesos(x):
    x['monthly_price']=str(x['monthly_price']).replace('$','')
    return x['monthly_price']

def eliminar_pesos_2(x):
    x['weekly_price']=str(x['weekly_price']).replace('$','')
    return x['weekly_price']

def eliminar_pesos_3(x):
    x['price']=str(x['price']).replace('$','')
    return x['price']
In [213]:
data_analisis['monthly_price']=data_analisis.apply(eliminar_pesos,axis=1)
data_analisis['weekly_price']=data_analisis.apply(eliminar_pesos_2,axis=1)
data_analisis['price']=data_analisis.apply(eliminar_pesos_3,axis=1)
In [214]:
data_analisis.head(100)
Out[214]:
id room_type bathrooms bedrooms bed_type price weekly_price monthly_price security_deposit city ... country first_review last_review review_scores_rating number_of_reviews has_availability availability_30 availability_60 availability_90 availability_365
0 2818 Private room 1.5 1.0 Real Bed 59.00 nan 1,500.00 $200.00 Amsterdam ... Netherlands 2009-03-30 2019-06-28 98.0 262 t 5 21 35 107
1 20168 Private room 1.0 1.0 Real Bed 80.00 nan nan NaN Amsterdam ... Netherlands 2010-03-02 2019-07-08 88.0 279 t 0 11 32 140
2 25428 Entire home/apt 1.0 1.0 Real Bed 125.00 650.00 2,000.00 $300.00 Amsterdam ... Netherlands 2018-01-21 2019-05-11 100.0 3 t 0 4 6 106
3 27886 Private room 1.0 1.0 Real Bed 150.00 810.00 2,500.00 $0.00 Amsterdam ... Netherlands 2012-01-09 2019-07-01 99.0 195 t 0 6 14 74
4 28871 Private room NaN 1.0 Real Bed 75.00 499.00 1,956.00 NaN Amsterdam ... Netherlands 2010-08-22 2019-07-02 97.0 277 t 5 7 10 138
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 214531 Private room 1.0 1.0 Real Bed 114.00 nan nan $0.00 Amsterdam ... Netherlands 2011-11-07 2019-05-29 96.0 158 t 0 0 0 223
96 217692 Entire home/apt 1.0 2.0 Real Bed 150.00 nan 2,499.00 $250.00 Amsterdam ... Netherlands 2014-09-17 2019-02-09 97.0 15 t 29 59 83 272
97 219276 Private room 1.0 1.0 Real Bed 265.00 nan nan $500.00 Amsterdam ... Netherlands 2012-07-31 2019-04-21 100.0 2 t 7 37 56 319
98 221922 Private room 1.0 1.0 Real Bed 99.00 nan nan $100.00 Bos en Lommer ... Netherlands 2011-11-07 2016-03-08 96.0 6 t 18 37 37 37
99 221943 Entire home/apt 1.0 2.0 Real Bed 170.00 1,200.00 3,500.00 $200.00 Amsterdam ... Netherlands 2011-10-12 2019-06-24 94.0 197 t 2 11 27 273

100 rows × 21 columns

Para las variables "price", "weekly_price" y "monthly_price" se va a eliminar el caracter "," , para finalmente corregirlo.

In [215]:
def eliminar_coma(x):
    x['monthly_price']=str(x['monthly_price']).replace(',','')
    return x['monthly_price']

def eliminar_coma_2(x):
    x['weekly_price']=str(x['weekly_price']).replace(',','')
    return x['weekly_price']

def eliminar_coma_3(x):
    x['price']=str(x['price']).replace(',','')
    return x['price']
In [216]:
data_analisis['monthly_price']=data_analisis.apply(eliminar_coma,axis=1)
data_analisis['weekly_price']=data_analisis.apply(eliminar_coma,axis=1)
data_analisis['price']=data_analisis.apply(eliminar_coma,axis=1)
In [217]:
data_analisis.head()
Out[217]:
id room_type bathrooms bedrooms bed_type price weekly_price monthly_price security_deposit city ... country first_review last_review review_scores_rating number_of_reviews has_availability availability_30 availability_60 availability_90 availability_365
0 2818 Private room 1.5 1.0 Real Bed 1500.00 1500.00 1500.00 $200.00 Amsterdam ... Netherlands 2009-03-30 2019-06-28 98.0 262 t 5 21 35 107
1 20168 Private room 1.0 1.0 Real Bed nan nan nan NaN Amsterdam ... Netherlands 2010-03-02 2019-07-08 88.0 279 t 0 11 32 140
2 25428 Entire home/apt 1.0 1.0 Real Bed 2000.00 2000.00 2000.00 $300.00 Amsterdam ... Netherlands 2018-01-21 2019-05-11 100.0 3 t 0 4 6 106
3 27886 Private room 1.0 1.0 Real Bed 2500.00 2500.00 2500.00 $0.00 Amsterdam ... Netherlands 2012-01-09 2019-07-01 99.0 195 t 0 6 14 74
4 28871 Private room NaN 1.0 Real Bed 1956.00 1956.00 1956.00 NaN Amsterdam ... Netherlands 2010-08-22 2019-07-02 97.0 277 t 5 7 10 138

5 rows × 21 columns

A pesar de eliminar el caracter ",", el tipo de variable de las variables "price", "weekly_price" y "monthly_price" siguen siendo carácteres.

In [218]:
data_analisis.dtypes
Out[218]:
id                        int64
room_type                object
bathrooms               float64
bedrooms                float64
bed_type                 object
price                    object
weekly_price             object
monthly_price            object
security_deposit         object
city                     object
state                    object
country                  object
first_review             object
last_review              object
review_scores_rating    float64
number_of_reviews         int64
has_availability         object
availability_30           int64
availability_60           int64
availability_90           int64
availability_365          int64
dtype: object

A continuación, se corrige el tipo de datos de las variables "price", "weekly_price" y "monthly_price", para tal fin se utiliza la función astype(tipo_variable) de pandas.

In [219]:
data_analisis["price"]=data_analisis["price"].astype(float)
data_analisis["weekly_price"]=data_analisis["price"].astype(float)
data_analisis["monthly_price"]=data_analisis["price"].astype(float)
In [220]:
data_analisis.head()
Out[220]:
id room_type bathrooms bedrooms bed_type price weekly_price monthly_price security_deposit city ... country first_review last_review review_scores_rating number_of_reviews has_availability availability_30 availability_60 availability_90 availability_365
0 2818 Private room 1.5 1.0 Real Bed 1500.0 1500.0 1500.0 $200.00 Amsterdam ... Netherlands 2009-03-30 2019-06-28 98.0 262 t 5 21 35 107
1 20168 Private room 1.0 1.0 Real Bed NaN NaN NaN NaN Amsterdam ... Netherlands 2010-03-02 2019-07-08 88.0 279 t 0 11 32 140
2 25428 Entire home/apt 1.0 1.0 Real Bed 2000.0 2000.0 2000.0 $300.00 Amsterdam ... Netherlands 2018-01-21 2019-05-11 100.0 3 t 0 4 6 106
3 27886 Private room 1.0 1.0 Real Bed 2500.0 2500.0 2500.0 $0.00 Amsterdam ... Netherlands 2012-01-09 2019-07-01 99.0 195 t 0 6 14 74
4 28871 Private room NaN 1.0 Real Bed 1956.0 1956.0 1956.0 NaN Amsterdam ... Netherlands 2010-08-22 2019-07-02 97.0 277 t 5 7 10 138

5 rows × 21 columns

Finalmente, se obtiene el número de variables y de filas del dataframe

In [221]:
data_analisis.shape
Out[221]:
(20337, 21)

El primer análisis sobre la tabla.

Para realizar un primer análisis exploratorio sobre la tabla, se utiliza la función pandas_profiling, el cual permite evidenciar el tipo de variable

In [222]:
pandas_profiling.ProfileReport(data_analisis)



Out[222]:

Supongamos, ahora, que vamos a realizar un análisis sobre "bed_type== Real Bed"., entonces:

In [223]:
consulta_1=data_analisis[data_analisis.bed_type=="Real Bed"]
consulta_1.head()
Out[223]:
id room_type bathrooms bedrooms bed_type price weekly_price monthly_price security_deposit city ... country first_review last_review review_scores_rating number_of_reviews has_availability availability_30 availability_60 availability_90 availability_365
0 2818 Private room 1.5 1.0 Real Bed 1500.0 1500.0 1500.0 $200.00 Amsterdam ... Netherlands 2009-03-30 2019-06-28 98.0 262 t 5 21 35 107
1 20168 Private room 1.0 1.0 Real Bed NaN NaN NaN NaN Amsterdam ... Netherlands 2010-03-02 2019-07-08 88.0 279 t 0 11 32 140
2 25428 Entire home/apt 1.0 1.0 Real Bed 2000.0 2000.0 2000.0 $300.00 Amsterdam ... Netherlands 2018-01-21 2019-05-11 100.0 3 t 0 4 6 106
3 27886 Private room 1.0 1.0 Real Bed 2500.0 2500.0 2500.0 $0.00 Amsterdam ... Netherlands 2012-01-09 2019-07-01 99.0 195 t 0 6 14 74
4 28871 Private room NaN 1.0 Real Bed 1956.0 1956.0 1956.0 NaN Amsterdam ... Netherlands 2010-08-22 2019-07-02 97.0 277 t 5 7 10 138

5 rows × 21 columns

Supongamos, ahora, que vamos a realizar un análisis sobre "bed_type== Real Bed" y number_of_reviews >23, entonces:

In [224]:
consulta_2=data_analisis[(data_analisis.bed_type=="Real Bed") & (data_analisis.number_of_reviews >23)]
In [225]:
consulta_2.head()
Out[225]:
id room_type bathrooms bedrooms bed_type price weekly_price monthly_price security_deposit city ... country first_review last_review review_scores_rating number_of_reviews has_availability availability_30 availability_60 availability_90 availability_365
0 2818 Private room 1.5 1.0 Real Bed 1500.0 1500.0 1500.0 $200.00 Amsterdam ... Netherlands 2009-03-30 2019-06-28 98.0 262 t 5 21 35 107
1 20168 Private room 1.0 1.0 Real Bed NaN NaN NaN NaN Amsterdam ... Netherlands 2010-03-02 2019-07-08 88.0 279 t 0 11 32 140
3 27886 Private room 1.0 1.0 Real Bed 2500.0 2500.0 2500.0 $0.00 Amsterdam ... Netherlands 2012-01-09 2019-07-01 99.0 195 t 0 6 14 74
4 28871 Private room NaN 1.0 Real Bed 1956.0 1956.0 1956.0 NaN Amsterdam ... Netherlands 2010-08-22 2019-07-02 97.0 277 t 5 7 10 138
5 29051 Private room 1.0 1.0 Real Bed 1435.0 1435.0 1435.0 NaN Amsterdam ... Netherlands 2011-03-16 2019-06-29 95.0 428 t 2 7 9 173

5 rows × 21 columns

A continuación, se ejecuta pandas profiling sobre el dataframe denominado consulta_2

In [226]:
pandas_profiling.ProfileReport(consulta_2)



Out[226]:

Validaciones con duplicados

Las validaciones con duplicados se pueden realizar con el siguiente código:

In [227]:
consulta_2.duplicated().sum()
Out[227]:
0
In [228]:
##Para eliminar duplicados (en caso que existan)
consulta_2=consulta_2.drop_duplicates()

Preguntas.

  1. Realizar un "enriquecimiento" de información, utilizando los datos que se encuentra en la página http://insideairbnb.com/get-the-data.html, para tal fin, utilice la funciòn merge (http://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/).
  2. Seleccione un conjunto de variables que pueda ser de interés.
  3. Teniendo en cuenta los resultados del ítem 2, realizar un análisis y el tratamiendo adecuado en función del tipo de variables, datos faltantes y datos duplicados.
  4. Formule 4 hipótesis que se puedan abarcar sobre el conjunto de datos resultantes en el punto anterior, y de ser prosible dar respuesta a estas utilizando pandas profiling.

Punto 1.

Voy a unir el dataframe resultante de la consulta 2 sin valores duplicados con otro dataframe de listings con más información.

In [238]:
nuevo = pd.read_csv("listings2.csv")
data_analisis = data_analisis.drop_duplicates()
final = pd.merge(data_analisis,nuevo,how='inner',on='id')
final
Out[238]:
id room_type_x bathrooms bedrooms bed_type price_x weekly_price monthly_price security_deposit city ... latitude longitude room_type_y price_y minimum_nights number_of_reviews_y last_review_y reviews_per_month calculated_host_listings_count availability_365_y
0 2818 Private room 1.5 1.0 Real Bed 1500.0 1500.0 1500.0 $200.00 Amsterdam ... 52.36575 4.94142 Private room 59 3 278 2020-02-14 2.00 1 125
1 20168 Private room 1.0 1.0 Real Bed NaN NaN NaN NaN Amsterdam ... 52.36509 4.89354 Private room 209 1 340 2020-04-09 2.67 3 3
2 25428 Entire home/apt 1.0 1.0 Real Bed 2000.0 2000.0 2000.0 $300.00 Amsterdam ... 52.37297 4.88339 Entire home/apt 125 14 5 2020-02-09 0.16 1 188
3 27886 Private room 1.0 1.0 Real Bed 2500.0 2500.0 2500.0 $0.00 Amsterdam ... 52.38761 4.89188 Private room 135 2 219 2020-07-25 2.09 1 183
4 28871 Private room NaN 1.0 Real Bed 1956.0 1956.0 1956.0 NaN Amsterdam ... 52.36719 4.89092 Private room 75 2 334 2020-08-16 2.74 2 348
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14958 36476715 Entire home/apt 1.0 1.0 Real Bed NaN NaN NaN $0.00 Amsterdam ... 52.36305 4.90461 Entire home/apt 135 3 3 2019-10-06 0.24 2 0
14959 36480247 Entire home/apt 1.0 1.0 Real Bed NaN NaN NaN NaN Amsterdam ... 52.36217 4.86017 Entire home/apt 170 2 5 2019-12-08 0.39 1 23
14960 36480673 Entire home/apt 1.5 2.0 Real Bed NaN NaN NaN $0.00 Amsterdam ... 52.38651 4.88689 Entire home/apt 225 2 4 2020-08-04 0.32 1 105
14961 36481981 Entire home/apt 2.5 3.0 Real Bed NaN NaN NaN NaN Amsterdam ... 52.41608 4.89954 Entire home/apt 100 4 2 2020-08-05 0.16 2 0
14962 36498483 Entire home/apt 1.5 2.0 Real Bed NaN NaN NaN $0.00 Amsterdam ... 52.35096 5.00173 Entire home/apt 90 2 2 2019-09-19 0.16 1 6

14963 rows × 36 columns

Punto 2.

In [239]:
final.columns #Quiero ver las columnas para escoger las que me interesan además de las de la consulta 2
Out[239]:
Index(['id', 'room_type_x', 'bathrooms', 'bedrooms', 'bed_type', 'price_x',
       'weekly_price', 'monthly_price', 'security_deposit', 'city', 'state',
       'country', 'first_review', 'last_review_x', 'review_scores_rating',
       'number_of_reviews_x', 'has_availability', 'availability_30',
       'availability_60', 'availability_90', 'availability_365_x', 'name',
       'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood',
       'latitude', 'longitude', 'room_type_y', 'price_y', 'minimum_nights',
       'number_of_reviews_y', 'last_review_y', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365_y'],
      dtype='object')
In [240]:
no_interes=['bathrooms','reviews_per_month','review_scores_rating','calculated_host_listings_count','first_review','last_review_x','last_review_y','has_availability','availability_30','availability_60','availability_90','availability_90','availability_365_x','availability_365_y','security_deposit','city','state','country']
final.drop(columns=no_interes,inplace=True)

Punto 3.

In [241]:
final.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14963 entries, 0 to 14962
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   14963 non-null  int64  
 1   room_type_x          14963 non-null  object 
 2   bedrooms             14956 non-null  float64
 3   bed_type             14963 non-null  object 
 4   price_x              1092 non-null   float64
 5   weekly_price         1092 non-null   float64
 6   monthly_price        1092 non-null   float64
 7   number_of_reviews_x  14963 non-null  int64  
 8   name                 14930 non-null  object 
 9   host_id              14963 non-null  int64  
 10  host_name            14908 non-null  object 
 11  neighbourhood_group  0 non-null      float64
 12  neighbourhood        14963 non-null  object 
 13  latitude             14963 non-null  float64
 14  longitude            14963 non-null  float64
 15  room_type_y          14963 non-null  object 
 16  price_y              14963 non-null  int64  
 17  minimum_nights       14963 non-null  int64  
 18  number_of_reviews_y  14963 non-null  int64  
dtypes: float64(7), int64(6), object(6)
memory usage: 2.3+ MB

Las columnas name y host_name cuentan con valores nulos, sin embargo, se pueden eliminar pues con el id y host_id es suficiente. La columna neighbourhood_group parece que no pudo ser leída correctamente por lo que también la voy a eliminar. También las columnas price_x, weekly_price y monthly_price tienen una gran cantidad de valores nulos, voy a eliminarlas pues parece que el price del dataframe nuevo 'price_y' contiene todos los precios. Finalmente, voy a eliminar las columnas repetidas (room_type_y y number_of_reviews_y)

In [242]:
a_remover= ['name','host_name','neighbourhood_group','price_x','weekly_price','monthly_price','room_type_y','number_of_reviews_y']
final.drop(columns=a_remover,inplace=True)
In [243]:
final.columns
Out[243]:
Index(['id', 'room_type_x', 'bedrooms', 'bed_type', 'number_of_reviews_x',
       'host_id', 'neighbourhood', 'latitude', 'longitude', 'price_y',
       'minimum_nights'],
      dtype='object')
In [244]:
final.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14963 entries, 0 to 14962
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   14963 non-null  int64  
 1   room_type_x          14963 non-null  object 
 2   bedrooms             14956 non-null  float64
 3   bed_type             14963 non-null  object 
 4   number_of_reviews_x  14963 non-null  int64  
 5   host_id              14963 non-null  int64  
 6   neighbourhood        14963 non-null  object 
 7   latitude             14963 non-null  float64
 8   longitude            14963 non-null  float64
 9   price_y              14963 non-null  int64  
 10  minimum_nights       14963 non-null  int64  
dtypes: float64(3), int64(5), object(3)
memory usage: 1.4+ MB

Punto 4.

Hipótesis:

  1. Los vecindarios más comunes de los hospedajes son los más recomendados en internet https://www.thebrokebackpacker.com/where-to-stay-in-amsterdam-netherlands/ https://www.touropia.com/where-to-stay-in-amsterdam-best-neighborhoods-hotels/
  2. Poca gente tiende a ofrecer habitaciones compartidas
  3. La gente que ofrece alojamiento en Amsterdam ofrece mínimo 3 días de alojamiento.
  4. El precio del alojamiento está entre 400 y 1200.
In [245]:
pandas_profiling.ProfileReport(final)



Out[245]:

Respuestas:

  1. Los vecindarios más comunes de los hospedajes son los más recomendados en internet Validada https://www.thebrokebackpacker.com/where-to-stay-in-amsterdam-netherlands/ https://www.touropia.com/where-to-stay-in-amsterdam-best-neighborhoods-hotels/
  2. Poca gente tiende a ofrecer habitaciones compartidas Validada, solo 48 personas.
  3. La mayoría de la gente que ofrece alojamiento en Amsterdam ofrece mínimo 3 noches de alojamiento. Refutada, la mayoría de personas (39,1%) ofrecen 2 días.
  4. El precio del alojamiento está entre 400 y 1200. Refutada, está entre 5 y 8000